Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #19268 to be assessed for backporting due to the inclusion of the label backport/1.16.
🚨
The person who merged in the original PR is:
@andrewstucki
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.
The below text is copied from the body of the original PR.
Description
This fixes #17557. In an attempt to support mesh gateways fronted by AWS load balancers, a code path for peered mesh gateways was introduced in #14917 that leverages envoy clusters backed by the
LOGICAL_DNS
cluster discovery type. Problematically, when replicas of mesh gateways exist in a peered connection, the dialing peer will hit this code path and attempt to add multiple endpoints for the targeted mesh gateways. Envoy, however, doesn't support multiple endpoints usingLOGICAL_DNS
and will start spitting out errors applying the xDS it receives from Consul.In Kubernetes, when a mesh gateway restarts then, it will never finish initializing and get marked as healthy, so its pod will continually restart and the gateway becomes unusable.
Because this requires using hostnames rather than IP addresses for the WAN addresses registered for mesh gateways, it likely impacts mostly Consul users on AWS, where hostnames are used for LoadBalancer services and thus registered for LoadBalancer type mesh gateways. This will also affect users who manually (or with annotations) register mesh gateways with mutiple FQDNs.
Note that this appears to only affect the dialing cluster in a peered connection, the accepting clusters use a different code path that only ever uses a single mesh gateway target and doesn't attempt to load-balance between multiple mesh gateways.
Testing & Reproduction steps
I was able to recreate this pretty easily outside of AWS by pinning the FQDN of the mesh gateways in the accepting cluster via something like:
which gives this for my dialing cluster:
And dialing cluster Consul logs then show:
and dialing cluster mesh gateway:
Swapping to
STRICT_DNS
allows the mesh gateway to finish configuration and boot properly.Links
Strict DNS in envoy.
PR Checklist
Overview of commits